Text Categorisation of Racist Texts Using a Support Vector Machine

نویسندگان

  • Edel P. Greevy
  • Alan F. Smeaton
چکیده

The automatic processing of text is a major challenge because of the increasing availability of textual information and the need to organise and manage such information effectively and efficiently. Automatic Text Categorisation is one of a number of functions we would like to have available to us and involves the assignment of one or more predefined categories to text documents in order that they can be effectively managed. In this paper we examine the problems associated with categorising texts documents (web pages) based on whether or not they are racist. We describe work in the PRINCIP project, which aims at the development of a system to detect racism based on the results of linguistic and statistical analysis of candidate texts. We take what we have learned from the PRINCIP research and apply machine learning techniques, specifically Support Vector Machines, to automatically categorise web pages. Our work shows that it is possible to develop automatic categorisation of web pages, based on these approaches.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

An Improvement in Support Vector Machines Algorithm with Imperialism Competitive Algorithm for Text Documents Classification

Due to the exponential growth of electronic texts, their organization and management requires a tool to provide information and data in search of users in the shortest possible time. Thus, classification methods have become very important in recent years. In natural language processing and especially text processing, one of the most basic tasks is automatic text classification. Moreover, text ...

متن کامل

Emotion Detection in Persian Text; A Machine Learning Model

This study aimed to develop a computational model for recognition of emotion in Persian text as a supervised machine learning problem. We considered Pluthchik emotion model as supervised learning criteria and Support Vector Machine (SVM) as baseline classifier. We also used NRC lexicon and contextual features as training data and components of the model. One hundred selected texts including pol...

متن کامل

تعیین مرز و نوع عبارات نحوی در متون فارسی

Text tokenization is the process of tokenizing text to meaningful tokens such as words, phrases, sentences, etc. Tokenization of syntactical phrases named as chunking is an important preprocessing needed in many applications such as machine translation information retrieval, text to speech, etc. In this paper chunking of Farsi texts is done using statistical and learning methods and the grammat...

متن کامل

SVM Categorizer: A Generic Categorization Tool Using Support Vector Machines

Supervised text categorisation is a significant tool considering the vast amount of structured, unstru ctured, or semi-structured texts that are available from internal or external enterprise resources. The goal of supervised text categorisation is to assign text documents to finite pre -specified categories in order to extract and automatically organise information coming from th ese resources...

متن کامل

Deep Belief Networks and Biomedical Text Categorisation

We evaluate the use of Deep Belief Networks as classifiers in a text categorisation task (assigning category labels to documents) in the biomedical domain. Our preliminary results indicate that compared to Support Vector Machines, Deep Belief Networks are superior when a large set of training examples is available, showing an F-score increase of up to 5%. In addition, the training times for DBN...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2004